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Abstract: 

This  paper  describes  a  system  for  sharing  tasks  among  processors  on  a  network  of  personal 
computers  and  presents  an  analysis  of  the  problem  of  scheduling  tasks  on  such  a  network.  Ihc 
system,  called  Enterprise,  is  based  on  the  metaphor  of  a  market:  processors  send  out  "requests  for 
bids"  on  tasks  to  be  done  and  other  processors  respond  with  bids  giving  estimated  completion  limes 
lliat  reflect  machine  speed  and  ciirrcnijv  loaded  files.  JTic  s\stem  includes  a  language  independent 
Distributed  Scheduling  Protocol  (DSP),  and  an  implemenuition  of  tJiis  protocol  for  scheduling 
remote  processes  in  Interlisp-D.  ihc  Fnterprise  implementation  assigns  processes  to  the  best 
machmc  a\ailable  at  run-time  (cither  remote  or  local)  and  includes  facilities  for  asynchronous 
message  passing  among  processes.  In  a  scries  of  simulations  of  different  load  conditions  and 
network  configurations,  DSP  was  found  to  be  substantially  superior  to  both  random  assignment  and 
a  more  complex  alternative  that  maintained  detailed  schedules  of  estimated  start  and  finish  times. 
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Kntcrprisc: 
A  Market-like  Task  Scheduler  for  Dislributcd  Computing  Environments 


Introdudioii 

Willi  the  rapid  spread  of  personal  compiiier  networks  and  llic  increasing  a\ailabilii\  of  low  cost 
VLSI  processors,  the  opportunities  for  massive  use  ^i'  parallel  and  distributed  computing  are 
becoming  more  and  more  compelling.  Parallel  hardware  can  often  increase  overall  program  speed 
by  concurrently  executing  independent  subparts  of  an  algorithm  on  different  processors  [1].  In 
some  cases,  parallel  search  algorithms  can  improve  average  execution  time  even  more  rapidly  than 
the  increase  in  the  number  of  processors  ((24),  [28],  [47]).  With  their  potential  for  multiple 
redundancy,  parallel  systems  can  be  made  much  more  reliable  than  dicir  single-processor 
counterparts.  There  arc  also  important  cases  of  inherently  distributed  problem  solving  where  the 
initial  information  and  resulting  actions  necessary  to  solve  a  problem  occur  in  widely  distributed 
locations  (e.g.,   [3],  (7],   [39],   [48]). 

Providing  appropriate  programming  facilities  for  specifying,  controlling,  and  debugging  parallel 
computations  involves  a  new  set  of  problems.  For  example,  Jones  &.  Schwarz  [27]  discuss  three 
main  classes  of  problems  in  multiprocessing  systems:  (1)  resource  schcJuIing-how  to  allocate 
processor  time  and  memorj'  space,  (2)  rcIlabilii}~-how  to  deal  with  various  kinds  of  component 
failures,  and  (3)  s}'nchronizaiion"how  to  coordinate  concurrent  activities  that  depend  on  each  other's 
actions. 

In  diis  paper  we  describe  a  prototype  system  called  Enterprise  that  addresses  those  problems  and 
present  the  results  of  analyzing  one  aspect  of  the  problem  of  resource  scheduling,  naincly  the 
scheduling  of  tasks  on  processors.  This  analysis  was  based  on  simulations  of  several  alternative  task 
scheduling  algorithms  operating  under  various  network  configurations  and  load  conditions. 
Although  we  have  fiKused  on  decentralized  methods  for  scheduling  tasks,  our  results  on  this  topic 
are  applicable  to  many  forms  of  parallel  computation,  regardless  of  whether  or  not  the  processors 
are  geographically  separated  and  whether  or  not  they  share  memory. 

The  Enterprise  system  schedules  and  nins  processes  on  a  local  area  network  of  high-performance 
personal  computers.  It  is  implemented  in  Inlcrlisp-I)  and  runs  on  iJie  Xerox  1100  (Dolphin),  1108 
(Dandelion),  and  1132  (Dorado)  Scientific  Information  Processors  connected  witli  an  Ethernet. 
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A    new    pliilosopli)    for   diMrihiitid   processing 

As  sunun.in/cd  in  l.ihlc  1.  the  tr.idituuidl  philosopin  used  in  dosiiiiiing  systems  based  on  loc.il  .irc.i 
networks  such  as  Hthcrncis  |.M]  is  lo  lia\c  dedicated  personal  workstations  which  remain  idle  when 
not  used  b>  their  owners,  and  dedicated  special  purpose  servers  such  as  file  servers,  print  servers, 
and  various  knids  of  data  base  servers.  ,\  system  like  l-.nicrp;ise  that  schedules  tasks  on  the  best 
pr(KCSsor  available  at  run  time  (either  remote  or  liKal)  enables  a  new  philosophy  in  designing  such 
distributed  systems.  In  this  new  philosophy,  personal  workstations  arc  still  dedicated  to  their 
owners,  but  during  the  (often  substantial)  periods  of  the  day  when  their  owners  are  not  using  them, 
tJiese  personal  workstations  become  general  purpose  servers,  available  to  other  users  on  the  network. 
"Server"  functions  can  migrate  and  replicate  as  needed  on  otherwise  unused  machines  (except  for 
those  such  as  file  servers  and  print  servers  that  are  required  to  run  on  specific  machines).  Thus 
programs  can  be  written  lo  take  advantage  of  the  maximum  amount  of  processing  power  and 
parallelism  available  on  a  network  at  any  lime,  with  little  extra  cost  when  there  are  few  extra 
machines  available. 

System   Architecture 

In  order  to  use  this  new  philosophy,  at  least  the  following  three  facilities  must  be  provided:  (1)  a 
way  of  commuiiicaiing  between  processes  on  different  machines,  (2)  a  way  of  schcJuIing  tasks  on  the 
best  available  machines  (either  remote  or  local),  and  (3)  programming  language  constructs  for 
dealing  with   remoteness. 

As  shown  in  Figure  1,  the  Enterprise  system  provides  these  facilities  in  three  layers  of  software. 
The  first  layer  provides  an  Inter-Proccss  Communication  (IPC)  facility  by  which  different  processes, 
either  on  the  same  or  different  machines,  can  send  messages  to  each  other.  When  tlie  different 
processes  are  on  different  machines,  the  IPC  protocol  uses  internetwork  datagrams  called  PUPs  (see 
[4])  to  provide  reliable  non-duplicated  delivery  of  messages  over  a  "best  efforts"  physical  transport 
medium  such  as  an  Ethernet  [34].  Enterprise  uses  a  pre-existing  protocol  that  is  highly  optimized 
for  remote  procedure  calls  ([2],  [42])  in  which  messages  are  passed  to  remote  machines  as  procedure 
calls  on   the  remote  machines. 

The  next  layer  of  the  Enterprise  system  is  the  Distributed  Scheduhng  Protocol  (DSP)  which,  using 
the  IPC,  locates  the  best  available  machine  to  perform  a  task  (even  if  Uic  best  machine  for  a  task 
turns  out  to  be  the  one  on  which  the  request  originated).  Finally,  tlie  top  layer  is  a  Remote  Process 
Mechanism,  which  uses  both  die  DSP  and  IPC  to  create  processes  on  different  machines  that  can 
communicate  with  each  other. 


Iiililc  1 
A  rii-n  (iistrihutcd  proccssin}>  philosophy 


1.  I  ruditional  distrihutcd  processing  philosophy 

a.  dedicated  personal  workstations 

b.  unused  workstations  are  idle 

c.  dedicated  special  purpose  servers 

2.  New  distributed  processing  philosophy 

a.  dedicated  personal  workstations 

b.  unused  workstations  arc  general  purpose  servers 

c.  special  purpose  ser\crs  only  where  required  by  hardware 
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Figure  1.  Protocol  layers  used  in  Uic  Enterprise  system. 
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One  (»("  llic  rriDSt  ob\i(His  solutions  lo  the  t.isk  stiicdulinc  problem  is  lo  simpl\  jssign  .1  i.isk  lo  ilic 
first  avi\ilahlc  processor.  This  approach,  or  one  similar  to  it  is  Liiken  in  many  existing  distributed 
systems  (e.g..  (2)).  As  we  will  see  in  more  detail  below,  there  are  three  important  conditions  under 
which   such   a  simple  approach   ma>    be  radically   sub-optimal: 

(1)  when  prix.essors  have  different  capabilities  (e.g..  different  speeds  or  different  files 
loaded   into  virtual   memory), 

(2)  when  tasks  have  difTcrent  priorities  (e.g..  some  subtasks  in  a  search  problem  are  more 
promising   that   others),  and 

(3)  when   network-wide  demand   for  processor  lime  is  high. 

The  problem  of  scheduling  tasks  on  processors  is,  of  course,  a  well-known  problem  in  traditional 
operating  systems  and  scheduling  theory,  and  there  are  a  number  of  mathematical  and  software 
techniques  for  solving  this  problem  in  both  single  and  multi-processor  systems  (e.g..  [5].  [8].  [9],  [27], 
129],  (30].  |44]).  Almost  all  tiie  traditional  work  in  Uiis  area,  however,  deals  with  centralized 
scheduling  techniques,  where  all  the  information  is  brought  to  one  place  and  the  decisions  are  made 
there.  In  highly  parallel  systems  where  the  information  used  in  scheduling  and  the  resulting  actions 
are  distributed  over  a  number  of  difTcrent  processors,  there  may  be  substantial  benefits  from 
developing  decentralized  scheduling  techniques  [33].  For  example,  when  a  centralized  scheduler 
fails,  the  entire  system  is  brought  to  a  halt,  but  systems  that  use  decentralized  scheduling  techniques 
can  continue  to  operate  with  all  the  remaining  nodes.  Furtliemiore.  much  of  the  information  used 
in  scheduling  is  inherently  distributed  and  rapidly  changing  (e.g.,  momentary  system  load).  Thus, 
decentralized  scheduling  techniques  can  "bring  the  decisions  to  the  information"  rather  dian  having 
to  constantly  transmit  the  information   to  a  centralized  decision  maker. 

Overview  of  the  Distributed  Scheduling  Protocol 

Since  many  of  the  problems  of  coordinating  concurrent  computer  systems  arc  isomorphic  to 
problems  in  coordinating  human  organizations  ([7].  [13],  [31].  [39]),  we  have  explicitly  drawn  upon 
organizational  metaphors  in  the  design  of  tlie  Enterprise  scheduling  mechanism.  In  particular,  the 
system's  Distributed  Scheduling  Protixrol  (DSP)  is  based  on  the  metapiior  of  a  market  (similar  to 
the  contract  net  protocol  of  Smith  and  Davis  [11],  [38],  [39]  and  the  scheduler  in  the  Distributed 
Computing  System  ([12])).  Wc  use  the  term  clients  for  the  machines  with  tasks  to  be  done  and 
contractors  for  the  remote  server  machines  (as  in  [13]).  'Ilie  essential  idea  is  that  a  client  requests 
bids  on  a  task  he  wants  done,  contractors  \\'ho  can  do  the  task  respond  with  bids,  and  the  client 
selects  a  conu-actor  from  among  the  received  bids.  Figure  2  illustrates  tlie  messages  used  in  tliis 
scheduling  process.      In   the   standard  case,   the   following  steps  occur: 


Typical  message  sequence; 


Client 


MACHINE 


Optional  messages 


Contractor 
machine 


Figure  2.  Mcssaccs  in  ihc  Disiribuicd  Scheduling  Protocol. 


1.  riic  ilicni  htihiilcitsis  o  "miiicM  Jar  bids",  'i'hc  request  lor  bids  iixludcs  ilie  piiority  of 
the  uisk.  ;iii\  special  requirements,  and  a  summary  description  of  llie  Uisk  tii.it  allows 
coniraclors   lo  csiim.itc   its   processing   lime. 

2.  Idle  contractors  respond  with  "bids"  giving  their  esfinidird  conplctioii  times.  Busy 
contractors  respond  with  "acknowledgements"  and  add  the  task  to  their  queues  (in  order  of 
priority). 

3.  When  a  contractor  becomes  idle,   ii  submits  a  bid  for  the  next  task  on  its  queue. 

4.  When  the  client  receives  a  bid.  it  sends  the  task  to  the  contractor  who  submitted  the  bid. 
If  a  later  bid  is  significantly  better  than  the  early  one.  ihe  client  cancels  the  task  on  the  first 
bidder  and  sends  the  task  to  the  later  bidder.  If  the  later  bid  is  not  significantly  better  (or 
if  the  task  has  side-effects  and  cannot  be  restarted),  the  client  sends  a  cancel  mcss;ige  to  ihe 
later  bidder. 

5.  When   a  contractor  finishes  a   task,    it  returns   the  result  to   the  client. 

6.  When  a  client  receives  the  result  of  a  task,  it  broadcasts  a  "cancel"  message  so  that  all 
the  contractors  can   remove  the  task  from   their  queues. 

If  a  task  takes  much  longer  tlian  it  was  estimated  to  take,  the  contractor  aborts  the  task  and  notifies 
the  client  that  it  was  "cut  off."  Tliis  cutoff  feature  prevents  the  possibility  of  a  few  people  or  tasks 
monopolizing  an  entire  system. 

Protection  against  processor  failure.  In  addition  to  this  bidding  cycle,  clients  periodically  query  the 
contractors  to  which  they  have  sent  tasks  about  the  status  of  the  tasks.  If  a  contractor  fails  to 
respond  to  a  query  (or  any  other  message  in  the  DSF),  the  client  assumes  the  contractor  has  failed. 
Failures  might  result  from  hardware  or  software  malfunctions  or  from  a  person  preempting  a 
machine  for  other  uses.  In  any  case,  unless  the  task  description  specifically  prohibits  restarting 
failed  tasks,  the  client  automatically  rcv  cdules  the  task  on  another  machine.  Similariy,  if  a 
contraclcr  fails  to  receive  periodic  queries  from  one  of  its  clients,  the  contractor  assumes  the  client 
has  failed  and  the  contractor  aborts  that  client's  task. 

The  "remote  or  local"  decision.  In  the  protocol  as  described  above,  if  the  local  machine  submits 
bids  for  its  own  tasks  (i.e.,  the  client  machine  offers  to  be  its  own  contractor),  then  the  local 
machine  would  presumably  always  be  the  first  bidder  and  uould  therefore  receive  every  task.  To 
prevent  this  from  happening,  the  client  waits  for  oilier  bids  during  a  specified  interval  before 
processing  its  own  bid.  Since  contractor  machines  arc  assumed  to  be  processing  tasks  for  only  one 
user  al  a  time,  the  client  machine's  own  bid  is  also  infiatcd  by  a  factor  that  reflects  the  current  load 
on  the  client  machine.  (Human  users  of  a  processor  can  express  their  willingness  to  have  tasks 
scheduled   locally  by  setting  cither  of  these  two  parajneters.) 


('ompiiiiMni  Willi  the  ainlnni  iiri  invimi'l.  I  iko  the  conir.icl  noi  piolocitl  (|11|.  I-^S|.  {M)]).  i)SI>  uses 
;iii  ".innoiiiiccmciu.  bid.  .iw.ird"  scihicikc.  This  .illdws  lor  imiintil  schciiDii  of  clients  ;ind 
conticiciors:  (li.it  is.  coiitr.ielors  clmosc  wliicli  clients  to  serve  .md  clients  choose  which  coiur.iciors  to 
use.  It  also  iiliows  for  ainmnwus  iiiwouinii  where  programmers  merely  describe  the  requirements 
of  a   task    \siihoui   specifying   which   processor   is   to   perform   the   task. 

The  most  imponant  difference  between  DSP  and  contract  nets  is  that  DSI'  restricts  the  basis  for 
mutual  selection  b\  clients  <ind  contractors  to  two  primary  dimensions:  (1)  contractors  select 
clients"  tasks  in  the  oider  of  numerical  insk  priorities,  and  (2)  clients  select  contractors  on  the  basis 
of  cMinmicJ  coiiiplciion  nines  (from  among  the  contractors  that  s.itisf\  the  minimum  requirements  to 
perfonn  the  job).  As  we  will  see  below,  this  specialization  of  the  contract  net  protcKol  allows  us, 
among  other  things,  to  make  some  \ery  nice  connections  with  results  from  traditional  scheduling 
iJicory   about  optimality. 

Remote   Processes   Rather    Ihun   Remote    Procedures 

One  of  the  most  important  issues  in  designing  a  system  for  distributed  parallel  processing  is  the 
choice  of  language  constructs  available  to  the  programmer.  Most  of  the  proposals  for  such 
constructs  fall  into  one  of  two  general  classes  [35]:  (1)  procedure-call  constructs  ([2],  [35],  [46]),  and 
(2)  message-passing  constnicts  ([19].  [20],  [25]).  In  most-but  not  all-cases,  message  passing  systems 
assume  that  the  objects  that  receive  and  send  messages  arc  concurrent  processes  with  separate 
"threads"  of  control,  while  procedure  call  systems  include  the  notion  of  transfer  of  control  from  the 
caller  to  the  callee. 

Though  tlicre  are  certainly  situations  in  which  remote  procedures  are  a  useful  programming 
construct  (e.g..  we  used  them  as  the  basis  for  our  message  passing  primitive),  we  believe  that 
processes  are  a  more  appropriate  and  powerful  abstraction  for  programming  many  distributed 
computations.  First,  rcnwleiiess  is  inherent!)  parallel.  Ilius.  it  is  appropriate  to  use  the  same 
language  constructs  for  remoteness  as  for  parallelism.  For  example,  process  mechanisms  typically 
provide  facilities  for  interprocess  communication  and  for  user  interaction  with  processes  (e.g.. 
deleting  a  running  process,  querying  the  status  of  a  process,  providing  input  for  a  process,  or 
receiving  output  from  a  process).  Those  facilities,  extended  to  include  remote  processes,  are  crucial 
to  the  programmer's  ability  to  specify,  debug,  and  control  distributed  computations.  In  addition, 
they  allow  a  program  to  be  designed  without  concern  for  the  physical  location  of  each  prcKCSS 
activation.  Second,  if  a  system  provides  remote  processes,  it  is  a  trivial  matter  for  users  to 
implement  their  own  remote  procedures  on  top  of  remote  processes,  but  the  reverse  is  certainly  not 
true.  The  primary  argument  in  favor  of  remote  procedures  seems  to  be  tliat  their  simplicity  allows 
them  to  be  implemented  more  efficiently  than  remote  processes.  However,  the  control  and 
communication  facilities  available  for  remote  prtKCSscs  make  them  tlic  construct  of  choice  in  most 
situations. 
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DSP  is  only  one  of  a  l.ircc  class  of  potcnli.ill)  ctrcLii\c  t.isk  scheduling  proiocols.  and  it  is  not 
immediately  clear  which  ivf  the  possible  schedulers  is  most  desirable,  l-or  example,  a  protocol  that 
allows  a  contracti)r  lo  bid  i)n  one  task  wliile  pcrfoniiinj:  another  »)ne  niiglu  be  more  elTicieni  than 
DSP.  Or.  random  scheduling  might  be  almost  as  good  as  a  sophisticated  scheduler  in  most 
commonly  occurinc  situations.  In  tlie  development  of  DSP  wc  analv/ed  the  objectives  of  ilie  task 
scheduler  and  used  a  simulation  program  to  compare  the  performance  of  pi'imising  alternative 
scheduling  proKxrols.      In   this  section    wc   present   the   results   of  that   study. 

Global  Scheduling  Objectives 

Traditional  schedulers  for  cenirali/ed  computing  systems  often  use  lisl  scheduling  as  a  basis  for 
layering  the  design  of  a  system  (e.g..  [8]).  In  lliis  approach,  one  level  of  the  system  sequences  jobs 
according  to  their  order  in  a  priority  lisl  while  the  policy  decisions  about  hov\  priorities  are  assigned 
are  made  at  a  higher  level  in  the  system  (see  [30]).  DSP  allows  precisely  tlie  same  kind  of 
separation  of  policy  and  mechanism.  The  DSP  protocol  itself  is  concerned  only  wiili  sequencing 
jobs  according  to  priorities  assigned  at  some  higher  level.  By  assigning  these  priorities  in  different 
ways,  the  designers  of  distributed  systems  can  achieve  different  global  objectives.  For  example,  it  is 
well  known  that  in  systems  of  identical  processors,  tlie  average  waiting  time  of  jobs  is  minimized  by 
doing  the  shortest  jobs  first  [9].  Thus,  by  assigning  priorities  in  order  of  job  length,  the  completely 
decentralized  decisions  based  on  priority  result  in  a  globally  optimal  sequencing  of  tasks  on 
processors. 

Oplimality  results  for  mean  flow  lime  and  maximum  flow  time.  Traditional  scheduling  theory  (e.g., 
[9])  has  been  primarily  concerned  with  minimizing  one  of  two  objectives:  (1)  the  average  flow 
time  of  jobs  (F3^,^)"the  average  time  from  availability  of  a  job  until  it  is  completed,  and  (2)  the 
maximum  flow  time  of  jobs  (Fmax^"^'^  "'"'^  """'  ^^  completion  of  the  last  job.  Minimizing  F^^^ 
also  maximizes  the  utilization  of  the  processors  being  scheduled  [8].  (A  third  class  of  results  from 
scheduling  theory,  involving  the  "tardiness"  of  jobs  in  relation  to  their  respective  deadlines,  appears 
to  be  less  useful  in  most  computer  system  scheduling  problems.)  Tlie  most  general  forms  of  both 
these  problems  are  NP-complcle  ([6].  [23]),  so  much  of  tlie  literature  in  this  field  has  involved 
comparing  scheduling  heuristics  in  terms  of  bounds  on  computational  complexity  and  "goodness"  of 
the   resulting  sclicdulcs   rcl.itive   to  optimal   schedules  (e.g..   [10].   [26]). 

A  number  of  results  suggest  the  value  of  using  two  simple  heuristics,  shortest  processing  time  first 
(SPT)  and  longest  processing  time  first  (LPT),  to  achieve  the  objectives  F^^^,  and  F^^,^,  respectively. 
First  wc  consider  cases  where  all  jobs  are  available  at  the  same  time  and  their  processing  limes  are 
known  exactly.  In  these  cases,  if  all  the  processors  arc  identical,  then  SPT  exactly  minimizes  F^^^, 
[9]  and  l.PT  is  guaranteed  to  produce  an  F^^^^^  that  is  no  worse  than  4/3  of  the  minimum  possible 
value  [17].   If  some  processors  arc  uniformly  faster  than  others,  then  tlie  I. PI  heuristic  is  guaranteed 


1(1  iniuiuco  .III  I'ljji^  no  worse  th.m  luicc  the  luM  poNsihlc  \;iliie  |l()|.  Next,  wo  consider  c,ise>  sphere 
;ill  jobs  arc  axail.ibic  at  ihc  siirnc  lime  but  iheir  cx.icl  processing  times  are  not  known  in  ad\ancc. 
Instead  llie  processing  limes  lia\e  certain  random  distributions  (e.g..  exiionenti.il)  with  dilTercni 
expected  values  for  difl'crcnl  jobs.  In  these  cases,  if  the  system  contains  identical  processors  on 
wiiich  prcemplions  and  sharing  are  allowed,  ihen  SIT  and  1.1*1  exactly  minimi/c  the  expected 
values  of  K^^^,  and  \-„^^^-  respectively.  (|45J.  |I5]).  I'inailv.  we  consider  cases  where  the  jobs  arc  not 
all  available  al  the  s;imc  lime  but  instead  arri\c  randomly  and  have  exponentially  distributed 
processing  times.  In  these  cases,  if  the  processors  are  identical  and  allow  preemption,  then  I  PI 
143). 

Oilier  scheduling  objcciivcs.  DSP  can  be  used  to  achieve  many  other  possible  objectives  besides  die 
iradilional  ones  of  minimizing  mean  or  maximum  flow  time  for  independent  jobs.    For  example: 

(1)  Parallel  heuristic  search.  Many  artificial  intelligence  programs  use  various  kinds  of 
heuristics  for  determining  which  of  several  aliernatives  in  a  search  space  to  explore  next. 
For  example,  in  a  traditional  "best  first"  heuristic  search,  the  single  most  promising 
allcmalive  al  each  point  is  always  chosen  lo  be  explored  next  [36].  By  using  ihc  hcurisiic 
evaluaiion  function  lo  determine  priorities  for  DSP,  a  system  with  n  processors  available  can 
be  always  exploring  the  ;/  most  promising  alternatives  rather  than  only  one.  Furtliermore, 
if  the  processors  have  different  capabilities,  each  task  will  be  executing  on  the  best 
processor  available   to   it,  given   its  priority. 

(2)  Arbitrary  market  with  priority  points.  Another  obvious  use  of  DSP  is  lo  assign  each 
human  user  of  the  system  a  fixed  nuinbcr  of  priority  points  in  each  time  period.  Users  (or 
their  programs)  can  then  allocate  these  priority  points  to  tasks  in  any  way  they  choose  in 
order  to  obtain  the  response  times  they  desire  (see  [41]  for  a  similar-though  non- 
automated--scheme). 

(3)  Incentive  market  with  priority  points.  If  the  personal  computers  on  a  network  are 
assigned  to  different  people,  then  a  slight  modification  of  the  arbitrary  market  in  (2)  can  be 
used  to  give  people  an  incentive  lo  make  tlicir  personal  computers  available  as  contractors. 
In  this  modified  scheme,  people  accumulate  additional  priority  points  for  their  own  later 
use,  every   time  their  machine  acts  as  a  contractor  for  someone  else's  task. 

Alternative  Scheduling  Protocols 

For  comparison  purports,  consider  the  follov^ing  two  alternative  protocols.  ITie  first  protocol  is  a 
scheme  designed  to  be  the  most  logical  extension  of  the  traditional  techniques  from  scheduling 
theory  (e.g.,  (9]).  (In  fact,  it  is  the  protocol  we  initially  thought  would  provide  the  best 
performance.)  The  second  is  a  random  assignment  method  that  provides  a  comparison  with  designs 
where  no  attention  is  given   to  the  scheduling  decision. 


iMi:icr  iissi^iiniciil 

The  first  .ilicrn;iti\c  proUKol  oncicoiucs  the  possible  deficiency  of  DSP  liiai  no  esiinuies  of 
complelion  limcs  arc  pnnided  by  processors  thai  arc  noi  ready  to  st.irl  immediately,  ["hat  is.  clients 
using  DSP  may  start  a  task  on  a  machine  that  is  axailabic  immediately  (p(>ssibly  their  own  local 
machine),  only  to  find  that  another  much  faster  machine  becomes  available  soon.  If  the  task  is 
canceled  and  restarted,  all  the  processing  time  on  the  first  machine  is  wasted.  If  not,  the  job 
finishes  later  than  it  could  have.  If  reasonable  estimates  of  complelion  times  on  currently  busy 
machines  could  be  made,  then  clients  would  know  enough  to  wait  for  faster  machines  that  were  not 
iininedialely  available,  ihese  estimates  might  also  be  useful  to  the  human  users  who.  when  DSP  is 
used,  have  no  idea  of  when   their  tasks  will   finish   until  the  tasks  actually    begin  execution. 

In  this  alternative,  tasks  are  assigned  to  contractors  as  soon  as  possible  after  the  tasks  become 
available  and  then  reassigned  as  necessary  when  conditions  change,  in  this  way.  each  contractor 
mainuins  a  schedule  of  tasks  it  is  expected  to  do,  along  witli  their  estimated  start  and  finish  times. 
and  so  the  contractor  can  make  estimates  of  when  it  could  complete  any  new  task  that  is  submitted. 
By  analogy  with  "lazy"  evaluation  of  variables  ([14],  [18])  the  original  DSP  could  be  called  "lazy 
assignment"  because  clients  defer  assigning  a  task  to  a  specific  contractor  until  the  contractor  is 
actually  ready  to  start.  This  alternative  protocol,  therefore,  will  be  called  "eager  assignment,"  since 
it  assigns  tasks  to  contractors  as  soon   as  possible. 

In  this  protocol,  all  contractors  bid  on  all  tasks  even  if  they  are  currently  busy.  A  contractor 
estimates  its  starting  time  for  a  task  by  finding  tlie  first  time  in  its  schedule  at  which  no  task  of 
higher  priority  is  scheduled.  Then  the  client  picks  the  best  bid  and  sends  the  task  to  the  contractor 
who  submitted  it.  When  new  tasks  are  added  to  a  contractor's  schedule,  or  when  a  task  takes 
longer  than  expected  to  complete,  the  contractor  notifies  the  owners  of  later  tasks  in  its  schedule 
that  their  reservations  have  been  "bumped."  Tlicsc  clients  may  then  try  to  reschedule  their  tasks  on 
other  contractors. 

It  is  important  to  note  that  even  in  cases  where  there  is  a  lot  of  bumping,  this  scheduling  process  is 
guaranteed  to  converge.  Since  tasks  can  only  buinp  the  reser\  ations  of  other  tasks  of  lower  priority, 
the  scheduling  of  a  new  task  can  nc\  cr  cause  more  than  a  finite  number  of  bumps.  To  reduce  the 
finite  (but  possibly  large)  amount  of  rescheduling  in  rapidly  changing  situations,  rescheduling  can 
occur  only  when   a  task   is  bumped  by  a  large  amount. 

Random  assignment 

In  the  second  alternative  protocol,  clients  pick  the  first  contractor  who  responds  to  their  request  for 
bids  and  contractors  pick  the  first  tasks  they  receive  after  an  idle  period.  Contractors  do  not  bid  at 
all  when  they  are  executing  a  task,  and  they  answer  all  requests  for  bids  when  they  are  idle.    If  a 


cliciu  dots  luil  receive  .iii\  hulv  U  Ldiilliuies  lo  ivbro.idciisi  ihe  ivquesi  fi>r  luds  peiKulicillv.  V\  lien 
coniniclois  receive  a  task  alter  alre.uix  heginninc  execiiliDii  dI"  anotlii ;  one.  ihe  new  i.isk  is  rejecied 
(with  a  "Ininip"  message)  and  ihe  clicni  uho  suliniiiicd  it  continues  ining  lo  schedule  it  elsewhere. 
In  the  siinulalions  discussed  beUm.  the  selection  ol"  the  I'nsl  bidders  when  more  than  one  irtachinc  is 
a\ailablc.  and  of"  the  Ihsi  task  when  more  llian  one  task  is  waning,  are  boih  modeled  as  r.indiv.n 
choices  since  the  dcla>  times  for  message  transmission  and  prwcssing  are  piesuinably  random.  (In 
reality,  fast  contractor  machines  migli!  often  respond  iikmc  quicklv  to  requests  for  bids  than  slow 
ones  and  so  would  be  more  likely  to  be  the  fiisi  bidders,  liuis  the  perfoniiance  of  this  scheduling 
mechanism    in   a   real   system   miglit   be  somewhat   belter   tli.in    the  siinulaied   perfoiinance.) 

Simulution    Results:      Minimi/.ing  Mean    Flon    Time 

Since  minimizing  the  mean  flow  tiine  of  independent  jobs  is  likely  to  be  the  primary  objective  in 
many  real  distributed  scheduling  cnvironmenLs,  and  since  analytic  results  about  this  topic  aic  so 
scarce,  it  is  appropriate  to  use  simulations  lo  compare  alternative  strategics  for  achieving  this 
objective.  In  this  section  we  summarize  the  results  of  a  series  of  simulation  studies  of  tJie  three 
distributed  scheduling  alternatives  outlined  above:  (1)  lazy  assis'in'enl  (the  original  DSP),  (2)  eager 
assignineni,  and  (3)  random  assignmetu.  In  both  the  eager  and  lazy  alternatives,  priorities  arc 
determined  according  to  the  shortest  processing  time  first  (SFT)  heuristic.  In  the  random 
alternative,  priorities  arc  not  used.  Iliese  simulation  studies  are  described  in  more  detail  by 
Howard   [21]. 

Method 

To  simulate  the  performance  of  the  alternative  scheduling  strategies,  most  of  the  code  for  the 
operational  Enterprise  systcrn  was  used,  with  some  of  the  ftinctions  (primarily  those  for  sending 
messages  between  machines)  redefined  so  that  a  complete  network  was  simulated  on  one  machine. 
ITie  completion  of  jobs  was  simulated  using  elapsed  time  of  the  simulation  clock,  uitli  faster 
machines  completing  simulated  jobs  proportionally  more  quickly   tlian  slow  ones. 

Job  loads.  For  all  the  simulations,  "scripts"  of  random  job  submissions  were  created.  All  jobs  were 
assumed  to  be  independent  of  each  other  and  required  to  be  run  remotely.  The  job  arrivals  were 
assumed  to  be  a  Poisson  prcx:ess  and  the  amount  of  processing  in  each  job  was  assuincd  to  be 
exponentially  distributed.  This  means  that,  at  every  instant,  there  was  a  constant  prob.ibiliiy  that  a 
new  job  would  arrive  in  the  system,  and  also  llial.  for  each  job  currently  executing,  thoie  was  a 
constant  probability  that,  at  any  instant,  it  would  end.  Hy  varying  the  paiamctcrs  of  the  random 
number  generators,  we  created  scripts  with  average  loads  of  0.1,  0.5.  and  0.9,  where  average  load  is 
defined  as  the  expected  amount  of  pr(x:cssing  requested  per  time  interval  divided  by  the  total 
ajnount  of  processing  power  in  the  system,   llius,  0.1  represents  a  light  load  and  0.9  a  heavy  load. 
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i'slinuiliiw  errors.  In  addilion  lo  ihc  ;icUi.il  anuninl  of  processing  in  each  job.  ihc  scripis  also 
incliuicd  an  c*-limatcd  amomu  of  processing  I'or  each  jtth  (i.e..  die  esiiniaie  a  user  niigln  ha\e  made 
of  how  long  ihe  job  would  lake).  In  order  lo  examine  extreme  cases,  these  estimates  were  either 
perfect  (0  percent  error)  or  very  inaccurate  (±  100  percent  error).  In  the  case  of  inaccurate 
estimates,   the   errors   were   uniformh    random!)    distributed   o\er   the   range. 

hhuhinc  ainfi^umiions.  Nine  different  configurations  of  machiiies  on  the  network  were  defined,  in 
all  configur.itions,  a  total  of  8  units  of  prwessing  power  was  available,  but  in  different  cases  this  was 
achieved  in  different  ways:  a  single  machine  of  speed  8;  or  8  machines  of  speed  1:  or  1  machine  of 
speed  4  and  2  madiines  of  speed  2;   etc. 

C'onvnunlciUions.  In  order  to  simulate  "pure"  cases  of  the  different  scheduling  meciianisins, 
communication  among  machines  was  assumed  to  be  perfectly  reliable  and  instantaneous.  In  real 
situations  where  communication  delays  arc  negligible  relative  to  job  processing  times,  this 
assumption  of  insuintaneoiis  commimications  is  appropriate.  In  other  cases,  different  assumptions 
about  communications  delays  might  change   the  trade-offs  among  scheduling  mechanisms. 

Bumping,  resiariing  after  laic  bids,  and  rebroadcasiing.  In  keeping  with  the  spirit  of  simulating 
"pure"  scheduling  mediods.  jobs  in  the  eager  simulations  are  rescheduled  every  lime  their  scheduled 
start  time  is  delayed  at  all.  In  a  real  system,  jobs  would  ordinarily  have  lo  be  bumped  by  more 
than  some  tolerance  before  being  rescheduled.  (After  a  job  begins  execution,  it  is  not  subject  to 
being  bumped.)  In  the  lazy  simulations,  jobs  are  never  restarted  when  late  bids  are  recci\ed  from 
fast  contractors.  In  other  words,  the  performance  of  the  eager  mctliod  could  only  get  worse  if  fewer 
bumps  were  mad^.  but  the  performance  of  the  la/y  method  might  improve  if  jobs  were  sometimes 
resuirted  after  receiving  late  bids. 

Similarly,  in  the  random  assignment  simulations,  clients  rebroadcast  their  requests  for  bids  in  every 
lime  interval  of  the  simulation  until  the  job  is  successfliUy  assigned  lo  a  contractor.  Thus,  this 
simulates  the  best  scheduling  performance  the  random  method  could  achieve;  if  rebroadcasting 
occurred   less  often,  the  performance  could  only  get  worse. 

Replications.  For  each  load  average,  five  different  random  scripts  of  job  submissions  were 
generated.  Then  tliesc  same  fi\c  scripts  were  used  for  each  of  the  three  scheduling  meUiods,  each 
of  the  two  accuracies  (0  and  ±  100  percent),  and  each  of  the  nine  machine  configurations.  By 
using  tlie  same  scripts  for  all  ihc  different  methods,  accuracies,  and  configurations,  we  obtained  a 
much  more  powerful  comparison  of  the  differences  due  to  the  factors  in  which  we  were  interested 
than   if  the  job   submissions  had  been  generated  randomly   in   each  different  case. 


I  Ik-  icsulls  piCM-'iiicd  holm^  .no  .I'-cr.iycs  nf  nic.in  n>iv\  liiiic  lor  job^  over  Ihc  scnpis  .niJ  o\i.M 
several  machine  conngiiraiions  in  each  case.  ITiere  arc  three  configiiralions  of  mulliple  identical 
machines.  fi\c  c^)n^lJUlrati;l^^  (if  niiilliple  nca-iJciuiL-.!)  m.iLhinos.  arJ  one  ci.nlkuration  of  a  sinj-lc 
machme.  Strictly  spi-akme.  since  the  JiiTerent  conliriLiratioas  arc  not  rep!cs>:iMaii\e  of  any  particular 
p<^pulation  anJ  since  there  arc  large  differences  between  dilTercnt  configurations  in  the  same 
catecorv.  one  should  be  hevii.inl  aboiil  a\or.fpinc  tiicn;  in  this  \\a).  TluT.'rorc.  wc  ,ii'o  normali/cd 
the  flow  times  t'or  the  la/y  and  eager  methods  in  each  conngiiration  b\  di\iuing  by  tJie  fiovs  time  of 
the  rand('m  method.  The  averages  of  these  noiTna!i/ed  values  showed  exactly  the  saine  relative 
patterns  as  the  averages  of  the  original  How  times,  so  the  original  flow  limes  arc  shown  here. 

i'ffeci  of  sclu'JuIing  mellwd  Somewhat  surprisingly.  Figure  3  shows  that  the  "la/y"  assignment 
method  is  at  least  as  good  as,  and  in  some  cases,  much  better  than  the  more  compiieaicd  and 
expensive   "eager"   assignment  method. 

r/fixl  ofsysiem  load.  With  perfect  estimates  of  processing  amounts,  botli  eager  assignment  and  lazy 
assignment  arc  consistently  better  than  random  assignment.  With  heavy  loads,  these  differences  are 
substantial  (5%-50%).  Even  with  light  loads  in  the  case  of  non-ideniical  processors,  there  is 
approxiinatel>  .'.  'lO'J  advantage  for  either  of  the  two  non-random  assignment  methods.  This  sizable 
advantage  under  light  loads  appears  to  be  because  the  two  non-random  assignment  methods 
consistently  assign  jobs  to  tJie  faster  machines  while  the  random  method  does  not.  At  light  loads,  a 
fast  machine  is  almost  always  available  so  this  is  a  significant  advantage.  At  moderate  loads,  fast 
machines  are  often  busy   so  this  assignmeiu  preference  does  not  matter  nearly  as  much. 

Kffcct  of  machme  configuration,  llicre  are  three  important  effects  of  machine  configuration:  (1)  As 
just  discussed,  there  is  a  sizable  advantage  of  tlie  non-mndom  scheduling  methods  at  lig!it  loads 
witli  non-identical  machines.  (2)  .^s  Figure  4  shows,  the  benefits  of  non-random  scheduling 
(averaged  over  all  loads)  increase  as  the  range  of  processor  speeds  in  the  network  increases.  (3) 
There  is  a  clear  advantage  at  all  load  levels  of  having  one  fast  m.xhinc  rather  than  a  combination  of 
slower  machines  with  the  same  toLil  processing  power,  lliis  last  result  can  also  be  derived 
analytically   (see   [33]). 

Effect  of  accuracy  cf  processing  lime  csiimates.  The  lazy  assignment  method  is  quite  robust  in  the 
face  of  poor  processing  lime  csiimates.  wiili  iis  "ver.ili  pcrfonnanco  degiadmc  on);,  a  few  percent 
even  when  the  csiimates  are  off  by  as  much  as  100  percent.  The  eager  assignmeiu  method,  on  the 
other  hand,  often  di)cs  much  worse  wiiii  bad  estimates  llian  with  good  ones,  and  in  some  cases  it 
even   does   worse   with   bad  estimates   than   purely   random   assignment. 
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inidiinl  (if  mcssiii^r  liiiffn.  liciiic  5  shows  ilic  lumibcr  dl  dircciod  and  hro.idcMsl  nioss.igcs  used  by 
ilic  iwn  non-r.indoni  stlicdiiling  mciliods  wiili  pcrlcci  piiH.ossiiig  nine  csiiin.iics  loi  jobs.  While  llic 
hi/\  melliod  icquires  almost  llic  same  mmiber  of  messages  per  job  at  boih  hc.i\\  and  liclu  loads. 
the  eager  method  requires  many  more  messages  as  loads  increase.  I  hough  lliey  arc  not  shown  here, 
similar  results  were  obtained  (or  poor  estim.ites  oljob  processing  linies  (±  100%  error),  in  ihcsc 
"pure"  simulations,  communication  was  treated  as  instantaneous  and  free  in  order  to  maximi/c 
scheduling  perfomiancc.  In  designing  an  actual  system,  one  would  presumably  sacrifice  st)mc 
scheduling  performance   in  order  to   reduce   the   number  of  messages. 

Discussion 

The  similarities  of  this  scheduling  problem  to  that  of  job  shop  scheduling  might  lead  one  to  believe 
that  ob\  ious  extensions  of  traditional  scheduling  theory  algorithms  such  as  the  eager  assignment 
mediod  would  provide  the  best  pcrfonn.ince  for  scheduling  of  distributed  Lisks.  Also,  early 
simulation  results  by  Conway.  Maxwell,  and  Miller  [9]  of  a  job  shop  environment  showed  that 
schedules  based  on  very  inaccurate  estimates  performed  almost  as  well  as  schedules  based  on  perfect 
estimates  and  much  better  than  random  schedules.  Why,  then,  should  the  eager  assignment  method 
do  so  poorly  in  our  examples  when  it  is  given  inaccurate  processing  lime  estimates?  Wc  believe 
that   two  primary   factors  account  for  this  result: 

(1)  "Stable  world  illusion."  In  the  eager  assignment  method,  each  job  is  assigned  to  the 
machin;  :'iat  estimates  the  soonest  completion  time.  But  if  jobs  of  higher  priority  arrive 
later  and  are  assigned  to  the  same  machine,  then  they  will  keep  "bumping"  the  first  job 
back  to  later  and  later  times,  in  other  words,  jobs  are  assigned  to  machines  on  the 
assumption  that  no  more  jobs  will  arrive  (i.e..  that  the  worid  will  remain  stable).  Hven 
though  in  the  simulation,  jobs  are  rescheduled  every  time  any  new  jobs  arrive  tliat  delay 
their  estimated  start  time,  by  the  time  the  job  is  rescheduled,  it  may  already  have  missed  a 
chance  to  start  on  another  machine  that  could  have  completed  it  before  it  will  now  be 
completed. 

In  some  of  our  simulations  (not  included  here),  the  bids  included  an  extra  factor  to  correct 
for  this  effect,  that  is,  bids  included  an  estimate  of  how  long  the  starting  time  of  the  job 
would  be  delayed  by  jobs  that  had  not  yet  arrived,  but  could  be  expected  to  arrive  before 
the  job  began  execution.  (See  [28]  for  the  derivation  of  this  correction  factor.)  Even 
though  the  inclusion  of  this  correction  factor  did  improve  the  performance  of  the  eager 
assignment  method  somewhat,   the  changes  were  not  substantial. 

(2)  Unexpected  amilabiliiy.  When  a  job  takes  longer  than  expected,  or  when  higher 
priority  jobs  arrive  at  a  processor,  all  the  clients  who  submitted  jobs  scheduled  later  on  that 
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pnKCssor  ;irc  noiillcd  wiili  "liimip"  mcss;iccs  .iiid  ci\cn  ;i  cliiincc  lo  rcsclicdulc  ihcir  jobs. 
When  ;i  job  takes  loss  liiiic  llian  expected  or  when  jobs  seheduled  on  a  piocessoi'  arc 
canceled,  the  processor  may  become  a\ailable  sooner  than  expected,  but  in  these  cases,  the 
clients  who  submitted  jobs  that  were  scheduled  elsewhere  but  who  might  now  want  lo 
rcschcdiiie  on  the  ncwl>  a\ailable  machine  .ire  ne\er  notified.  I  here  can  thus  be  situations 
where  fast  processors  are  idle  while  high  priority  jobs  wait  in  queues  on  slower  pnxrcssors. 
This  appears  to  be  a  serious  weakness  of  the  eager  assignment  metJiod.  Wc  have  specified, 
but  not  implemented,  an  addition  to  the  protocol  that  notifies  all  clients  of  such  situations 
and  allows  ihcm  to  reschedule  llicir  tiisks.  !1ic  cost  of  lliis  addition  would  be  even  greater 
message  traffic  and  system  complexity,  and  wc  believe  it  unlikely  that  the  resulting 
perfoniijnce  would  be  significantK  better  than  tlie  much  simpler  la/y  assignment  melliod. 

Inipliaiiions.  'ITie  most  important  and  surprising  result  of  these  simulations  is  the  superiority  of  the 
lazy  assignment  method.  F.vcn  though  ihe  lazy  method  is  simpler  to  implement  and  often  requires 
substantially  less  message  traffic,  it  never  perfonns  much  worse  than  the  alternative  meUiods,  and  it 
usually  performs  much  better. 

There  are  two  important  qualifications  on  the  generality  of  these  results:  First,  the  introduction  of 
communication  delays  might  change  the  irade-ofFs  among  scheduling  methods  in  a  few  cases.  For 
example,  in  heavily  loaded  systems  witJi  very  long  communication  delays,  the  eager  assignment 
method  transmits  jobs  lo  contractors  in  advance  and  thus  removes  one  communication  lag  from  the 
round  trip  fiow  time.  Also,  these  simulations  were  only  attempting  to  minimize  average  flow  time, 
not  any  of  the  other  possible  global  objectives.  Nevertheless,  we  believe  that  the  superiority  of 
deferring  assignment  of  tasks  to  machines  in  multi-machine  scheduling  environments  is  a  principle 
that   is  likely   to  be  very  widely   applicable,  and  dial  is  not  yet  widely   known. 


DISTRIBUTED  SCHEDULING  PROTOCOL  DEFINITION 

In  this  section,  we  present  a  detailed  definition  of  tlie  Distributed  Scheduling  Protocol  (DSP)  that  is 
used  to  implement  the  lazy  assignment  method.  This  protocol  can  be  used  by  cooperating  machines 
in  a  network  even  when  die  machines  use  different  programming  languages  and  operating  systems. 
This  section  (and  the  Inioilisp  implcmcnDiion  of  the  proUKols)  describe  all  the  message  formats  as 
lists  of  fields  with  no  field  widths  or  data  types  specified,  lo  use  the  protocol  with  other  languages, 
field  widths  and  data  types  should   be  specified  as  well. 

The  message  definitions  in  Figures  6  and  7  use  a  modified  form  of  BNF  specification.  Nonterminal 
symbols  arc  enclosed  by  angle  brackets  ("<  >"),  terminal  symbols  are  written  witliout  delimiters, 
ellipses  (".  .  .")  are  used  lo  indicate  lists  containing  an  arbitrary  number  of  0  or  more  terms,  and 
square  brackets  ("I  ]")  indicate  comments. 
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<II'Cmcss.igc>  -»  (<scndcr><rcLOi\cr><b()dy>  IPCControlliifo) 

<scndcr>  -*  <prix:cssll)> 

<rcccivcr>  -►  <priKCSsin> 

<proccssIl)>  -»•  (liostNamc  pioccssNamc) 


17 


\-\\lTV  7 

DSr  Mi'ss;i};c  lormals 

<l)SI'incsvigc>   -♦  <rcquosif()rbids>  |  <bid>  |  <;ickiu)wlcdi:cmcm>  |  <task.>  |  <rcsiili>  |  <quci>>  | 
<suiuis>  I  <bunip>  I  <canccl>  |  <cuU)t'l> 

<rcqucMfi)rbids>  -♦  (RI-QLH-S 1 1  OKIMDS  <t.iskll)>  prioritv  <c.irIicsiSiariTimo>  <icquircmcnui> 
<uiskSuminary>) 

<bid>  -►  (HlIXuisklDXstaiirinicXcomplclionTimc)) 

<ackn(mlcdgcmcnt>  -►  (ACK  <lasklD» 

<lask>  -♦  ( lASK  <taskll)>  priorii-.  <iaskSuinmary>  laskDcscription) 

<rcsult>  -►  (RHSUl.  1  <taskll)>  <av^..:;Siatiis>  <runiimc>  result) 

<qucry>  -  (QUKRY  <laskll)» 

<siatus>  -» (STATUS <taskstatiis> <siartTimc> <complctionTimc>) 

<bump>  -*  (BUMP  <siartTimc>  <complciionTime>) 

<cancel>  -*  (CANCF.l.  <taskID>) 

<cutoff>  -►  (CUTOFF  <taskID>  <runtimc>) 

<taskID>  -» (hostNamc  taskNamc  taskCrcationTime  lastMilcsloncTime) 

<taskCreationTimc>  -*  systcmTime 

<lastMilcstoneTime>       -» systcmTime 

<carlicstStartTimc>  -»  systcmTime 

<startTimc>        -►  systcmTime 

<complctionTimc>  -►  systcmTime 

<resultStatiis>     —  NORMAL  |  ERROR 

<runtime>  -»  (clapscdTimc  machine  Type) 

<taskstatus>        —  LOCAI .  |  BIDDING  |  SCHEDULED  |  DELIVERED  1  RUNNING  |  NORMAL  | 
ERROR  I  DELETED  |  CUTOFF 

<rcquircmcnls>  -»  (<rcquircmcnt>  . . .  <rcquirment>) 

<rcquiremenL>    -*  REMOTE  |  (HOSTS  <hostName>. .  .<hosiName>)  |  [other  terms  to  be  added] 

<taskSummary>  -•  (<summarytcrm>  . . .  <summaryterm>) 

<summaryterm>  -►  ( TIME  timcEst)  |  (FILES  <rjlcDescriptionList>)  |  [otJicr  terms  to  be  added] 

<fiicDcscripiionList>  -»  <rileDescription>  |  <filcDcscription>  <fileDescriptionList> 

<nieDescription>  -♦  (fileName  filcCreationDatc  fileLoadTimcF^t) 


Inlet- I'nnns    ('(Hiinnniitdiidii    I'minail 

I)St*  is  based  on  an  Inioi -Process  Conmuinicaiion  prolocol  (li'C)  for  the  transpon  ol"  niessaces. 
Messages  in  the  IPC  are  assinned  to  be  reliably  delivered  without  duplication.  If  the  IPC 
guarantees  that  messages  arc  received  in  the  .irdcr  ihc\  are  scni.  then  some  redundant  procossiiij; 
will  be  avoided,  but  the  DSP  does  not  icqinre  ibis  guarantee.  As  specified  in  Pigure  6.  messages 
arc  assumed  to  contain  at  least  the  following  information:  sender,  receiver,  and  body.  The  sender 
and  recetvcr  Mc  globally  unique  processlDs  for  the  sending  and  receiving  processes.  I  he  body  in.iy 
be  a  DSP  message,  as  denned  below,  or  some  other  message.  The  mess<iges  may  also  conuin  other 
IPC  control  information  such  as  a  message  identifier,  time  sent,  and  inform. iiion  about 
acknowledgements  and  duplicates. 

DSl*  Message  I'onnals 

Figure  7  specifies  the  fonnats  for  tlic  different  DSP  messages.  The  exact  formats  for  the 
laskDescripiian.  laskSummary.  prioriiy.  requirements,  and  rcsuli  fields  arc  implcmenialion  dependent, 
with  the  only  restriction  being  that  contractors  thai  rccogni/e  and  satisfy  the  requirements,  must  also 
be  able  to  recognize  and  deal  with  the  specified  priority  and  the  descriptions  in  the  laskSummary 
and  the  taskDescnption.  Roth  the  requirements  and  the  laskSummary  are  intended  to  have  new 
terms  added  as  the  protocol  is  used  for  different  types  of  machines  and  different  types  of  tasks. 

TasklDs  are  task  idenlificrs  that  arc  guaranteed  to  be  unique  across  time  and  space.  Since  a  given 
task  can  be  restarted  during  its  lifetime  (e.g..  because  of  a  processor  failure),  it  is  also  necessary  to 
distinguish  between  these  different  "incarnations"  of  the  same  process  [35].  In  order  to  do  tliis.  a 
timestamp  of  the  most  recent  "milestone  event"  in  the  life  of  the  process  is  included  in  the  lasklD. 
Milestone  events  are  the  sending  of  citlier  a  request  for  bids  or  a  task  message  concerning  ihc  task. 
Doth  these  events  render  obsolete  all  previous  DSP  messages  concerning  the  task.  Before 
responding  to  DSP  messages  about  a  particular  task,  therefore,  both  clients  and  contractors  check  to 
be  sure  the  message  concerns  the  most  recent  incarnation  of  the  task.  (Ta^kliX  serve  the  saine 
purpose  as  die  call  identifiers  used  by  Birrell  and  Nelson  [2]).  In  Enterprise,  both  proccssIDs,  and 
lasklDs  use  the  host's  network  address  as  the   hosiName. 

The  sysicmTime  is  assumed  to  be  the  same  (witliin  sir.all  limits)  across  machines.  KhipsedTime  is 
measured  in  the  same  units  as  systcmTime,  and  machineType  is  an  implementation  dependent 
encoding  of  tlie  different  types  of  machines  on  Uie  netwoik.  Bidders  who  are  ready  to  start  on  a 
task  immediately  indicate  that  fact  by  using  the  earlUsiSiuriTnnc  spccifi'^d  in  the  request  for  bids  as 
tlic  suiriTimc  in  tlicir  bid.  even  if  lii.ii  lime  has  already  passed.  Contractors  indicate  that  they 
cannot  accept  a  task  at  all  by  returning  a  bump  mess<i2C  whose  startTimc  is  0  or  NIL.  'ITie 
fileDcscripiions  include  a  filcCreciioiiDaic  so  conuactors  can  dcienninc  if  the  version  of  a  file  they 
have  loaded  is  the  correct  one. 
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The  ccncnil  coiisiraints  on  mcssiigc  processing  in  DSP  arc  described  above  in  Uic  oxerxiew  of  the 
scheduling  process.  Contractors  arc  assiniicd  lo  bid  on  or  acknowledge  onl>  tiisks  whose 
rcLiuircmcuis  ihey  satisfy .  A  contractor  can  only  process  one  remote  task  at  a  time,  hut  once  a  task 
has  begun  execution,  it  can  create  any  number  of  other  IcKal  (or  remote)  liisks.  Different 
implcmeniaiions  of  DSP  can  use  different  policies  for  priority  setting,  estimating  finish  limes  for 
bids,  canceling  and  restarting  after  late  bids,  evaluating  a  client's  own  hid,  and  culling  off 
excessively  long  tasks.  The  only  restrictions  on  these  policies  stem  from  the  fact  that  consistent 
biases  (e.g..  "lying")  on  the  part  of  sonic  clients  or  ctHitraclors  in  setting  priorities  or  submitting  bids 
may  lead  lo  radically  siiboplimal  schedules.  Choices  about  restiining  and  cuU)tT  policies  also  involve 
tradeoffs  between  ihe  amount  of  compulation  and  communication  devoted  to  scheduling  and  ihe 
efficiency   of  ihc  schedules. 

Scheduling  in   Knterprisc 

The  initial  version  of  the  Enterprise  system  makes  the  following  choices  in  implementing  the  DSP: 

(1)  Tusk  descripiions  and  results  are  character  strings  consisting  of  arbitrary  Inierlisp  forms 
"dereferenced"   to  the   "print  names"  of  the   forms. 

(2)  Task  summaries  consist  of  (a)  the  estimated  processing  time  for  the  task  on  a 
"standard"  processor  (in  our  case  a  Dolphin  processor),  and  (b)  the  names  and  lengths  of 
the  files  that  must  be  loaded  before  the  processing  can  begin.  ITie  estimaied  processing 
times  for  a  task  are  supplied  by  the  programmer,  or  if  no  estimate  is  supplied,  a  default 
value   is  used. 

(3)  Priorities  arc  simply  tlie  estimated  processing  time  (including  the  esiimaied  time  to  load 
all  files)  on   a  standard  processor,   with  low  numbers  signifying  high   priority. 

(4)  Requirements  can  include  (a)  "REMOTE"  (as  opposed  to  the  default  which  is 
"REMOTEORLOCAL")  and/or  (b)  a  list  of  acceptable  contractors.  Tasks  that  are  required 
to  be  local   never  use  the  DSP. 

(5)  Bids  take  into  account  tJie  processor  speed  of  the  contractor  submitting  tlic  bid  (relative 
to  a  "sLindard"  processor)  and  die  file  loading  lime  for  all  required  files  not  already  loaded 
on  ihc  contractor's  machine.  File  loading  time  is  estimated  as  being  proportional  to  file 
length. 

(6)  "Late"  bids  from  bidders  who  were  ready  to  start  as  soon  as  diey  received  the  request 
for  bids  are  accepted  (and  the  task  is  canceled  on  the  early  bidder's  machine)  if  they  are 
better  tlian  the  earlier  bid  by  an  amount  greater  Uian  BidTolerance.  All  other  late  bids 
arc  rejected. 
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(7)  A  cliciil  ni.idnnc's  hid  on  ils  own  l.isk  is  processed  aflor  .1  dcl.u  of  OwnBidDelay.  As 
in  iiucrim  nicisiirc.  its  own  processing  lime  is  inll.iied  b\  .1  t'.icior  egu.il  lo  ilie  number  of 
other  processes  acii\e  on   the  client   mjichine. 

(8)  A  liisk  (and  any  siibtasks  il  has  created)  is  cuioffii'  it  exceeds  its  estimated  time  b.\  a 
factor  grc.iier   than   CutoffFactor. 

"Gaming"  the  sysicni.  If  people  supply  ihcir  own  estimates  of  prcxressing  times  for  ihcir  tasks  and 
llicsc  time  estimates  arc  .ilsi)  used  to  determine  priority,  there  is  a  clear  incentive  for  people  lo  bias 
Ihcir  pnxrcssing  linic  estimates  in  order  lo  get  higher  priority.  This  incentive  to  give  biased 
estimates  is  counteracted  in  the  current  system  by  the  possibility  of  a  job  being  cutoff  if  it  greatly 
exceeds  its  estimated  time.  In  general,  this  issue  of  "incentive  compatibility"  [22]  is  an  important 
one   in   designing   any   organization   that   involves  human   actors. 

Conclusion 

Wc  believe  that  this  paper  has  made   three  primary  contributions: 

First,  any  designer  of  a  parallel  processing  computing  system,  whetiier  the  processors  are 
geographically  distributed  or  not,  must  solve  the  problem  of  scheduling  tasks  on  processors.  Wc 
presented  a  simple  heuristic  method  for  solving  this  problem,  and  demonstrated  with  simulation 
studies  its  superiority  to  two  plausible  alternatives.  The  simulation  studies  highlighted  the  benefit  of 
deferring  as  long  as  possible   tlie  actual  assignment  of  tasks  to  processors. 

The  scheduling  hcuiistic  wc  presented  has  the  additional  advantage  of  lending  itself  very  naturally 
to  a  dcccntrali/cd  implementation  in  which  separate  decisions  made  by  a  set  of  geographically 
distributed  processors  lead  to  a  globally  coherent  schedule.  To  aid  fiiture  implcmenters  of  such 
distributed  systems,  wc  formalized  a  language-independent  protocol  (the  Distributed  Scheduling 
Protocol)   for  coordinating  decentralized   scheduling  decisions. 

Finally,  whether  scheduling  decisions  are  centralized  or  decentralized,  the  Enterprise  system  points 
the  way  toward  a  new  generation  of  distributed  computing  environments  in  which  programmers  can 
easily  take  advantage  of  the  maximum  amount  of  processing  power  and  parallelism  available  on  a 
network   at  any  time,   with  little  extra  cost  when   there  are  few  extra  machines  available. 
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